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1.  RESEARCH  OBJECTIVES 

The  purpose  of  this  research  project  is  to  develop  methods  for  the 
analysis  and  synthesis  of  complex  fault-tolerant  computer  systems.  It  is 
motivated  by  recent  rapid  developments  in  large  scale  and  very  large  scale 
integration  (LSI  and  VLSI)  technology,  especially  the  introduction  of 
microprocessors  and  microcomputers,  which  are  expected  to  increase  greatly 
the  need  for  highly  reliable  computer  systems.  The  research  is  concerned 
with  fault  diagnosis,  reconfiguration  and  recovery  in  the  event  of  failures. 
Its  goals  include  the  development  of  specific  measures  of  the  cost  and  com¬ 
plexity  of  fault  tolerance,  and  the  derivation  of  efficient  fault-tolerant 
design  algorithms  based  on  these  measures.  The  problems  associated  with  the 
design  of  systems  containing  many  microcomputers  were  studied,  with  emphasis 
on  the  connecting  networks  required  for  fault-tolerant  intercomputer  communi¬ 
cation.  Testing  procedures  and  easily-testable  design  methods  for  complex 
digital  systems  employing  LSI/VLSI  technology  were  also  investigated. 
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2.  RESEARCH  ACCOMPLISHMENTS 

I 


Major  new  results  were  obtained  in  the  following  areas: 

(1)  Fault-tolerant  interconnection  networks 

(2)  Analysis  of  reconfiguration  and  recovery 

(3)  Bit-sliced  microprocessor  systems 

(4)  Testing  general  LSI/VLSI  systems 

These  results  are  summarized  in  this  section;  detailed  descriptions  can  be 
found  in  the  cited  references. 

2 _•  1  Fault-Toler ant  J nt erconnecti o n  Networks  [ 1  - 2 ] 

A  comprehensive  study  of  the  fault-tolerance  requirements  of  complex 
multicomputer  systems,  such  as  systems  containing  large  numbers  of  microprocessors, 
was  completed.  This  work  is  fully  documented  in  John  P.  Shen's  Ph.D. 

Dissertation  [2],  A  survey  of  the  interconnection  requirements  of  multicomputer 
systems  was  carried  out  which  lead  to  the  conclusion  that  a  class  of  inter¬ 
connecting  networks  called  S-networks  constituted  one  of  the  most  practical 
communication  structures  for  such  systems.  Although  the  communication 
requirements  of  6-networks  have  been  studied  in  the  past,  mainly  in  the  context 
of  telephone  switching  systems  [3,4],  their  fault  tolerance  properties  have 
received  little  attention.  It  was  realized  early  in  this  study  that  intercon¬ 
necting  networks  play  a  critical  role  in  the  reliability  and  fault  tolerance 
of  computer  systems. 
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Fig.  1 


(a)  A  6-network  and  (b)  the  corresponding  f-graph 


A  6-network  is  a  connecting  network  composed  of  2x2  crossbar  switches. 

Fig.  la  shows  a  simple  network  composed  of  four  6-networks  which  can  provide 
communicated  between  the  eight  computers  denoted  Cq,  ,  ....  Cy.  We  have 
introduced  an  analytical  model  called  a  6- graph  which  allows  a  6-network 
to  be  represented  by  a  standard  directed  graph.  Fig.  lb  shows  the  2-graph 
that  represents  the  6-network  of  Fig.  la.  Many  of  our  results  are  expressed 
in  terms  of  6-graphs  or  other  graphs  derived  from  6-graphs. 

In  this  analysis  we  assume  that  faults  in  6-networks  are  caused  by 
failure  of  the  individual  6-elements.  A  6-element  has  two  states  during 
normal  operation,  a  through  (T)  and  a  cross  (X)  state.  A  fault  may  cause 
6-elements  to  become  stuck  in  either  the  T  or  the  X  state.  We  have  developed 
a  new  measure  of  the  fault  tolerance  of  6-networks  using  a  connectivity  cri¬ 
terion  called  dynamic  full  access  (DFA).  A  £-network  is  said  to  have  DFA  if 
each  of  its  inputs  can  be  connected  to  any  of  its  outputs  by  means  of  a  finite 
number  of  passes  through  the  network.  Note  that  the  computers  provide  a  set  of 
feedback  paths  that  allow  information  to  be  routed  from  computer  to  computer 
until  the  desired  destination  is  reached.  Fault  tolerance  is  achieved  by 
rerouting  data  transmissions  to  avoid  faulty  6-elements  or  computers. 

A  fault  in  a  6-network  is  called  critical  if  it  destroys  DFA.  A  minimal 

critical  fault  (MCF)  is  one  none  of  whose  subsets  is  a  critica.l  fault.  We  have 

obtained  several  complete  graph-theoretical  characterizations  of  the  critical 
faults  of  a  6-network;  for  details  see  [1.  2],  For  example,  we  have  shown 

that  a  fault  f  is  critical  if  and  only  if  the  state  due  to  f  is  not  compatible 

with  any  state  of  the  P-network  that  creates  an  Eulerian  circuit,  i.e.,  a 
single 'dosed  path  through  all  edges,  in  the  corresponding  6-graph. 


A  £-network  is  defined  to  be  k-fault  tolerant  or  k-FT  is  the  failure 
of  any  k  or  fewer  E-elements  does  not  destroy  DFA.  The  largest  k  for  which 
a  g-network  is  k-FT  is  called  the  fault  tolerance  (FT)  parameter  of  the 
E-network.  In  the  synthesis  of  practical  fault  tolerant  S-networks,  network 
performance  must  also  be  considered.  A  performance  criterion  called  the 
communication  delay  (CD)  parameter  was  introduced,  which  is  defined  as  the 
worst  posible  transmission  delay  through  the  E-network  in  terms  of  the  number 
of  intervening  E-elements  between  any  pair  of  communicating  devices.  It  was 
proven  that  the  FT  parameter  k  and  CD  parameter  d  of  any  E-network  with  n 
E-elements  must  satisfy  the  following  fundamental  bounds: 

0 1  k  f  n  - 1 

Llog-,  nJ  +  Udin. 

It  has  also  been  shown  that  these  bounds  are  tight. 

The  design  of  fault  tolerant  E-networks  for  multicomputer  systems 
typically  involves  striking  a  balance  between  fault  tolerance  and  communication 
delay.  Two  E-network  designs  were  obtained  which  possess  extreme  values  for 
k  and  d.  The  modified  inverse  shuffle-exchange  (MISE)  E-network  was  shown  to 
have  FT  parameter  k  =  1  and  CD  parameter  d  =  L 1 og^  nJ  +  1.  Another  E-network 
called  the  double  parallel  ring  (DPR)  network  was  shown  to  ha.ve  FT  parameter 
k  =  n-1  and  CD  parameter  d  =  n.  It  was  further  demonstrated  that  the  CPR- 
network  is  unique  in  achieving  the  maximum  value  n-1  of  the  FT  parameter. 

These  results  shed  considerable  light  on  aspects  of  F-network  behavior  which 
are  often  not  obvious  from  the  network  structure  alone.  For  example,  Fig.  2 
shows  'two  16x16  £-network$  of  very  similar  structure.  However,  their  FT  and 
CD  parameters  are  quite  different.  The  network  of  Fig.  ?a  is  0-FT,  as  is 


evident  from  its  6-graph  appearing  in  Fig.  2b.  The  6-network  of  Fig.  2c  is 
the  DPR-network  with  n=8,  hence  it  is  7-FT.  It  can  easily  be  seen  from  the 
corresponding  6-graph  in  Fig.  2d,  that  the  CP  parameter  of  this  6-network 
is  seven. 

The  preceding  theoretical  results  were  applied  to  the  analysis  of  various 
6-network  structures.  Some  new  properties  of  cascaded  6-networks  were  derived. 
The  FT  and  CD  parameters  were  obtained  for  each  of  the  following  well-known 
6-networks:  the  double-tree  (DOT)  network,  the  indirect  binary  m-cube  (m-IBC) 
network,  and  the  Benes  rearrangeabl e  (BRS)  network. 

2.2  Analys iso f  Recon figurati on  and  Recovery  [ 5 ] 

Most  previous  research  in  faul t-tolerant  computer  design  has  been  concerned 
either  with  system  reliability  or  fault  diagnosis.  Other  important  aspects  of 
system  behavior,  notably  recovery,  have  received  little  attention,  even  though 
they  play  a  central  role  in  fault  tolerance.  In  this  project  a  new  method  for 
analyzing  recovery  in  faul t-tol erant  multiprocessor  systems  was  developed.  The 
system  is  represented  by  a  redundant  facility  graph  G^  in  which  nodes  correspond, 
to  processors  and  edges  correspond  to  communication  links  [6].  The  fault-free 
nodes  include  nodes  actively  engaged  in  data  processing  and  nodes  acting  as 
standby  spares.  A  fault  is  represented  by  the  removal  of  a  node  and  its 
associated  edges  from  Gf.  Faults  are  tolerated  by  reconfiguring  the  active  and 
spare  nodes  in  Gr  so  that  there  always  exists  an  active  subnetwork  that  is 
isomorphic,  that  is,  has  the  same  interconnection  structure,  as  a  certain 
minimum  configuration  Gb  called  the  basic  system.  G^  can  be  taken  as  the  minimum 
fault-free  system  needed  to  perform  a  particular  set  of  tasks. 


A  system  Gr  is  called  k-faul t-tolerant  (K-FT)  t-step  recoverable  (t-SR) 
if  it  can  recover  from  up  to  k  faults  by  changing  the  states  of  at  most  t 
fault-free  nodes,  k  is  clearly  a  measure  of  the  amount  of  damage  the  system 
can  tolerate.  A  state  change,  e.g.  from  spare  to  active,  typically  involves 
the  creation  of  new  logical  paths  in  the  system,  and  the  transfer  of  status 
information  between  the  affected  nodes.  If  t  state  changes  are  required  to 
recover  from  a  particular  fault,  then  t  is  an  approximate  measure  of  the 
system's  recovery  time.  Clearly  t  is  at  least  equal  to  k. 

A  case  of  particular  interest,  corresponding  to  a  class  of  systems  with 
minimum  recovery  time,  has  t  =  k.  In  such  systems  recovery  from  t  faults  is 
achieved  by  immediate  replacement  of  each  failed  node  by  a  fault-free  spare. 

Gr  is  defined  to  be  optimally  t-SR  with  respect  to  an  n-node  basic  system  Gb  if 

(1)  G  is  t-FT/t-SR  with  respect  to  G. 

r  b 

(2)  Gr  contains  the  minimum  number  of  nodes,  viz.  n  +  t 

(3)  G^  contains  the  fewest  edges  among  all  systems  satisfying 
conditions  (1)  and  (2) 

We  have  shown  that  the  optimal  t-SR  realization  of  every  G^  is  unique,  and  that 

it  has  a  relatively  simple  structure  [5].  Figure  3a  shows  an  example  of  a 

basic  graph  1^  consisting  of  four  processors  arranged  in  a  ring.  Figure  3b 

shows  the  corresponding  optimal  2 - SR  graph  I^  .  It  consist  of  1^  with  two 

additional  spare  nodes,  labeled  s-|  and  s^.  and  additional  edges  connecting 

OPT 

s-|  and  s^  To  all  nodes  of  1^  .  Every  fault  graph  formed  by  removing  one  or 

OPT 

two  nodes  from  ^  contains  a  subgraph  isomorphic  to  ^  (the  2-FT  property). 


Figure  3.  (a)  A  4-node  basic  system  Ib> 


(b)  The  corresponding  optimal  2 - S R  system  1^ 


OPT 


H 
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Furthermore ,  each  such  subgraph  can  be  chosen  so  that  it  differs  from  the 
original  active  subgraph  in  at  most  two  nodes  (the  2-SR  property). 

Optimal  t-SR  systems  have  the  disadvantage  that  the  number  of  edges 
connected  to  some  nodes  i.e.,  the  node  degree,  may  be  very  large.  Since 
this  represents  the  number  of  parallel  data  paths  to  a  processor,  it  is  often 
severly  restricted  by  physical  considerations,  for  example,  microprocessor 
pin  limitations.  Thus  nonoptimal  fault-tolerant  systems  with  limited  node 
fanout  are  of  interest.  We  have  investigated  a  class  of  graph  transformations, 
called  line  graph  transformations ,  which  lead  to  t-SR  designs  with  nodes 
of  lower  degree  than  the  corresponding  optimal  t-SR  systems  [5],  We  have  also 
shown  that  line  graph  transformations  greatly  simplify  the  computation  of  the 
parameters  k  and  t. 

2_.  3  Bit-Si  iced  Microprocessor  Systems  [7-12] 

A  major  investigation  of  the  testing  requirements  of  bit-sliced  computers 
has  been  completed  under  partial  AFOSR  sponsorship  [7-12].  Bit-sliced  systems 
are  an  important  class  of  digital  systems  that  have  a  regular  array-like  struc¬ 
ture,  which  is  particularly  attractive  for  VLSI  technology.  A  bit-sliced 
system  is  realized  by  interconnecting  identical  slices  or  cells  to  form  a 
one-dimensional  iterative  logic  array  (ILA).  In  this  study  the  design  of 
bit-sliced  systems  that  are  easily  testable  has  been  investigated. 

First  an  analytic  test  generation  methodology  for  bit-sliced  and  related 
systems  was  developed.  For  this  purpose  a  high-level  (register-level)  circuit 
model  and  a  corresponding  functional  fault  model  were  specified.  A  1-bit 
processor  slice  C  having  the  main  features  of  commercial  slices  was  defined  as 
a  test. case.  Figure  4  shows  the  internal  organization  of  C.  Using  the  high-level 
circuit  and  fault  models  a  technique  for  deriving  a  complete  and  near-minimal 


(a) 


Fig.  4  1  -bit  processor  model  C 
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length  test  sequence  for  C  was  developed.  The  cell  C  was  then  extended  to 
form  general  k-bit  slices  whose  internal  structures  more  closely  resemble 
commercial  products.  It  was  shown  that  test  patterns  for  an  array  of  N 
identical  processor  cells  can  be  easily  derived  from  the  tests  for  an 
individual  cell.  Furthermore,  the  number  of  test  patterns  needed  for  the 
processor  array  is  independent  of  the  array  length.  It  was  therefore  con¬ 
cluded  that  for  test  generation  purposes,  bit-sliced  processors  can  be  use¬ 
fully  modeled  as  C-testable  ILAs,  which  require  a  constant  number  of  test 
patterns  independent  of  array  size. 

The  property  of  C-testabil ity  in  one-dimensional  ILAs  was  studied  in 
detail  [9,  10],  Basic  concepts  of  C-testabil ity  in  unilateral  combinational 
arrays  were  investigated  first.  C-testable  arrays  were  characterized  and 
procedures  to  construct  test  patterns  for  such  arrays  were  developed.  A  new 
design  method  to  make  an  ILA  C-testable  was  proposed.  C-testable  arrays  of 
bilateral  and  sequential  cells  were  also  analyzed.  A  characterization  of 
C-testable  bilateral  combinational  ILAs  was  obtained,  as  well  as  a  design 
modification  scheme  to  make  a  bilateral  ILA  C-testable.  It  was  shown  that  the 
results  on  C-testable  combinational  arrays  can  be  applied  directly  to  a  useful 
class  of  sequential  arrays. 

A  new  testability  criterion  for  ILAs  called  I-testabil ity  was  introduced 
[8,  10,  11].  I-testabil ity  ensures  that  identical  test  responses  can  be  obtained 
from  every  cell  of  an  ILA,  and  thus  simplifies  response  verification.  I-test- 
able  combinational  ILAs  were  characterized,  as  well  as  Cl-testable  ILAs  that 
are  simultaneously  C-  and  I-testable.  A  design  scheme  for  making  an  arbitrary 
ILA  Cl-testable  was  also  constructed.  Finally,  a  design  technique  for  realizing 
self-testing  bit-sliced  computers  based  on  I-testing  was  developed.  It  was 
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Fig.  5.  A  self-testing  bit-sliced  CPU. 
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established  that  the  family  of  processor  arrays  constructed  of  cell  C  and 
its  extensions  is  Cl-testable.  Using  Cl-testable  processor  arrays  and  other 
I-testable  bit-sliced  circuits,  a  self-testing  computer  was  designed. 

Figure  5  shows  the  central  processing  unit  (CPU)  of  this  computer.  The 
advantages  and  limitations  of  the  proposed  design  were  analyzed  and  compared 
to  more  conventional  self-checking  approaches  that  are  based  on  coding 
techniques  [10,  11]. 

In  con junction  with  a  VLSI  design  course  at  USC  in  Spring  1980,  we  carried 
out  the  complete  IC  chip  design  of  a  4-bit  microprocessor  composed  of  four 
copies  of  the  slice  C  [11].  This  design  was  done  using  the  software  design 
tools  developed  at  Caltech  and  Xerox  Corporation  [16],  The  resulting  chip  was 
fabricated  using  NMOS  technology  in  Summer  1980  as  part  of  a  mul ti -uni vers i ty 
VLSI  project  sponsored  by  ARPA  and  Xerox.  Considerable  insight  into  the  pro¬ 
blems  of  VLSI  design  were  obtained  from  this  work,  as  well  as  a  much  better 
understanding  of  the  limitations  imposed  by  IC  technology  on  the  testability 
of  complex  computer  circuits. 

2.4  Testing  General  LSI/VLSI  Systems  [13-15] 

Most  existing  analytical  tools  are  inadequate  for  dealing  with  digital 
components  above  the  gate  and  flip-flop  levels  which  correspond  to  small-scale 
Integration  (SSI)  in  current  technology.  There  is  at  present  no  adequate 
theory  for  the  design  or  testing  of  LSI/VLSI  devices,  although  the  need  for 
such  a  theory  has  long  been  recognized  [15]. 

We  have  observed  that  a  significant  property  of  components  at  all  com¬ 
plexity  levels  is  expansibility,  which  is  the  ability  of  components  of  a  given 
type  to  be  interconnected  In  a  systematic  way  to  form  larger  components  of  the 
same  type  [13].  The  larger  component  performs  the  same  operation  as  its  con- 
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stituent  elements,  but  processes  more  and/or  bigger  operands.  Expansibility 
plays  a  particularly  important  role  in  the  architecture  of  microcomputers. 

The  major  design  problems  revolve  around  the  number,  size  and  interconnections 
of  the  ROM's,  RAM's  and  10  interface  circuits  used,  problems  which  are  inti¬ 
mately  associated"  with  the  expansibility  of  these  components.  With  bit-slice 
architecture,  the  CPU  (microprocessor)  becomes  an  expandable  design  component. 
Two  main  expansion  techniques  have  been  identified,  expansion  by  composition 
and  by  replication.  It  has  been  demonstrated  that  expansion  methods,  can  be 
concisely  defined  by  recursive  logic  equations.  We  have  shown  that  most 
standard  components  can  be  expanded  using  sets  of  recursive  equations  called 
FS2  algorithms  which  allow  neither  feedback  nor  constant  input/output  values, 
and  which  require  at  most  two  logic  levels.  Several  other  useful  expansion 
methods  have  also  been  identified  [13]. 

A  new  approach  to  processing  the  very  large  amounts  of  test  and  response 
data  associated  with  very  complex  digital  systems  such  as  VLSI  circuits  was 
developed  [14].  This  approach  treats  complex  digital  signals,  called  vector 
sequences,  as  primitive  elements;  this  implies  that  complex  subcircuits  can 
also  be  treated  as  primitive.  New  operators  for  manipulating  vector  sequences 
were  discovered,  and  their  basic  properties  were  investigated.  We  have  shown 
that  substantial  compression  of  test  information  is  achievable  using  the  vector 
sequence  approach. 

The  elements  of  this  testing  approach  are  sequences  of  digital  signals 
appearing  in  the  input/output  lines  of  logic  components  or  circuits.  Such  a 
sequences  is  represented  by  a  2-dimenslonal  matrix  which  is  called  a  vcctoi 
sequence.  For  example,  the  binary  test  sequence  applied  to  an  5-input  circuit 
in  six  clock  periods  might  be  denoted  by  the  following  5x6  vector  sequence 
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S, 


in 


0  0  1  0  0  1 
1  0  0  0  0  1 
0  0  1  0  0  0 
0  0  1  0  0  1 
0  0  0  0  0  0 


(1) 


The  horizontal  dimension  of  this  matrix  represents  time  suitable  quantized, 
while  each  position  in  the  vertical  or  space  dimension  corresponds  to  a 
distinct  line  or  bus  L  in  the  circuit  under  consideration.  The  quantization 
of  the  vertical  dimension  corresponds  directly  to  the  complexity  level  of  the 
information  units  and  circuits  components  being  used.  Any  submatrix  of  a 
vector  sequence  can  be  represented  by  a  primitive  symbol.  For  example,  if 
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then  we  can  replace  (1)  by 


(2) 


In  (2)  S^n  is  represented  by  a  2x2  matrix,  which  corresponds  to  a  higher-level 
view  of  S^n  than  (1).  The  highest-level  occurs  when  S  is  treated  as  a  1x1  vector 
sequence,  that  is,  as  a  single  primitive  signal. 

We  have  defined  a  set  of  fundamental  operators,  denoted  { . ,  C,  x ,  & 
for  processing  vector  sequences.  The  operators  •  and  <L  represent  concatenation 
(external  expansion)  in  the  time  and  space  dimensions,  respectively,  while  x 
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and®  denote  new  operations  called  internal  expansion.  We  have  also  defined  | 

a  set  of  "standard"  vector  sequences  from  which  the  test  data,  both  input 

patterns  and  output  responses,  for  a  variety  of  complex  circuits  can  be 

constructed.  We  have  shown  that  the  vector  sequence  approach  is  applicable 

to  logic  circuits  at  all  complexity  levels,  from  gate-level  circuits  to 

microprocessor-based  systems.  We  are  continuing  to  work  on  the  development 

of  a  test  generation  algorithm,  analogous  to  the  D-Algorithm,  in  which  the 

test  data  is  represented  by  vector  sequences. 
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